Airbnb kainų analizė Europos miestuose¶

Hipotezės:

  • Airbnb nuoma kainuoja pigiau darbo dienomis nei savaitgaliais;
  • Geri Airbnb įvertinimai tiesiogiai priklauso nuo švaros įvertinimo;
  • Airbnb kainos nustatymui galima susikurti modelį remiantis turimais Airbnb duomenimis (kaina, žmonių kiekis, miegamųjų skaičius, atstumas nuo miesto centro, atstumas nuo metro, švaros įvertinimas ir bendras įvertinimas).

Darbo tikslai:¶

Išanalizuoti Europos miestus pagal Airbnb būsto kainą savaitgaliais ir darbo dienomis;¶

Nustatyti Airbnb įvertinimo priklausomybę nuo švaros įvertinimo;¶

Sukurti modelį Airbnb kainų nustatymui.¶

Duomenų rinkinys¶

Duomenų šaltinis:
https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities

Analizuojami miestai: Amsterdamas, Atėnai, Barselona, Berlynas, Budapeštas, Lisabona, Londonas, Paryžius, Roma, Viena

Turimi duomenų rinkiniai kiekvienam miestui savaitgaliais ir darbo dienomis:

realSum - the total price of the Airbnb listing;
room_type - private/shared/entire home/apt;
room_shared - whether the room is shared or not;
room_private - whether the room is private or not;
person_capacity - the maximum number of people that can stay in the room;
host_is_superhost - boolean value indicating if host is a superhost or not;
multi - indicator whether listing is for multiple rooms or not;
biz - indicator whether listing is for business purposes or not;
cleanliness_rating - the cleanliness rating of the listing;
guest_satisfaction_overall - overall rating from guests camparing all listings offered by host;
bedrooms - the number of bedrooms in the listing;
dist - distance from city center;
metro_dist - the distance from the nearest metro station;
lng & lat - coordinates for location identification.

In [2]:
import os
import glob
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()
import numpy as np
from scipy import stats
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Duomenų įkėlimas¶

Šiame etape įsikeliami parsisiųsti Airbnb duomenų rinkiniai kiekvienam miestui (10 csv failų su darbo dienų duomenimis ir 10 csv failų su savaitgalių duomenimis), atskiri failai naudojantis ciklu "for" sujungiami į vieną, ciklo metu sukuriami nauji stulpeliai "City" ir "Weekday" iš failų pavadinimų, panaikinami nereikalingi stulpeliai, pakeičiami duomenų tipai, sukuriamas naujas stulpelis "price_per_person".

In [3]:
# Pakeičiama direktorija. Nuvedama į duomenų saugojimo vietą
os.chdir("C:\\Users\\egecaite.BAIPGROUP\\Desktop\\DataEra\\Baigiamasis\\Data")

# Nurodomas failų plėtinys
extension = 'csv'

# Gaunamas visų failų pavadinimų su nurodytu plėtiniu sąrašas
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# Sukuriamas tuščias sąrašas DataFrames saugojimui
dfs = []

# Sukuriams for ciklas, kuris pereidamas per kiekvieną CSV failą jį perskaito į DataFrame ir sudeda visus
# į sukurtą tuščią sąrašą dfs

for file in all_filenames:
    # Perskaitomas CSV failas į DataFrame
    df = pd.read_csv(file)
    # "City" ir "Weekday" paimama iš CSV failo pavadinimo
    city = os.path.basename(file).split('_')[0]
    weekday = os.path.basename(file).split('_')[1].split('.')[0]
    # Pridedami nauji stulpeliai "City" ir "Weekday"
    df['City'] = city
    df['Weekday'] = weekday
    # DataFrame įdedams su Append į dfs sąrašą
    dfs.append(df)

# Sukuriama naujas DataFrame, kuris talpina visus DataFrames dfs sąraše
airbnb = pd.concat(dfs, ignore_index=True)

# Išmetami nereikalingi stulpeliai su drop
airbnb = airbnb.drop(columns=['attr_index','attr_index_norm','rest_index','rest_index_norm'])
airbnb.drop('Unnamed: 0', axis=1, inplace=True)

# Miestų pavadinimai perrašomi iš didžiosios raidės
def capitalize_column(df, column):
    df[column] = df[column].str.title()
    return df

airbnb = capitalize_column(airbnb, 'City')

# Pakeičiami duomenų tipai'multi' ir 'biz' į boolean
airbnb['multi'] = airbnb['multi'].astype(bool)
airbnb['biz'] = airbnb['biz'].astype(bool)

# Pakeičiami stulpelių pavadinimai į suprantamesnius
airbnb.rename(columns={'realSum':'price', 'dist' : 'citycentre_dist', 'multi' : 'multiple_room', 
                             'biz' : 'business_room', 'City' : 'city', 'Weekday' : 'weekday'}, inplace=True)

# Sukuriamas naujas stulpelis 'price_per_person'
airbnb['price_per_person'] = airbnb['price'] / airbnb['person_capacity']

# Išsaugomas sutvarkytas duomenų rinkinys
#airbnb.to_csv('airbnb_full.csv', index=False, encoding='utf-8-sig')

airbnb.head(5)
Out[3]:
price room_type room_shared room_private person_capacity host_is_superhost multiple_room business_room cleanliness_rating guest_satisfaction_overall bedrooms citycentre_dist metro_dist lng lat city weekday price_per_person
0 194.033698 Private room False True 2.0 False True False 10.0 93.0 1 5.022964 2.539380 4.90569 52.41772 Amsterdam weekdays 97.016849
1 344.245776 Private room False True 4.0 False False False 8.0 85.0 1 0.488389 0.239404 4.90005 52.37432 Amsterdam weekdays 86.061444
2 264.101422 Private room False True 2.0 False False True 9.0 87.0 1 5.748312 3.651621 4.97512 52.36103 Amsterdam weekdays 132.050711
3 433.529398 Private room False True 4.0 False False True 9.0 90.0 2 0.384862 0.439876 4.89417 52.37663 Amsterdam weekdays 108.382349
4 485.552926 Private room False True 2.0 True False False 10.0 98.0 1 0.544738 0.318693 4.90051 52.37508 Amsterdam weekdays 242.776463

Greita duomenų analizė¶

In [415]:
# Eilučių ir stulpelių kiekiai
airbnb.shape
Out[415]:
(51707, 18)
In [416]:
# Duomenų užpildymas ir tipai stulpeliuose
airbnb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51707 entries, 0 to 51706
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   price                       51707 non-null  float64
 1   room_type                   51707 non-null  object 
 2   room_shared                 51707 non-null  bool   
 3   room_private                51707 non-null  bool   
 4   person_capacity             51707 non-null  float64
 5   host_is_superhost           51707 non-null  bool   
 6   multiple_room               51707 non-null  bool   
 7   business_room               51707 non-null  bool   
 8   cleanliness_rating          51707 non-null  float64
 9   guest_satisfaction_overall  51707 non-null  float64
 10  bedrooms                    51707 non-null  int64  
 11  citycentre_dist             51707 non-null  float64
 12  metro_dist                  51707 non-null  float64
 13  lng                         51707 non-null  float64
 14  lat                         51707 non-null  float64
 15  city                        51707 non-null  object 
 16  weekday                     51707 non-null  object 
 17  price_per_person            51707 non-null  float64
dtypes: bool(5), float64(9), int64(1), object(3)
memory usage: 5.4+ MB
In [417]:
# Unikalūs miestai
airbnb['city'].unique()
Out[417]:
array(['Amsterdam', 'Athens', 'Barcelona', 'Berlin', 'Budapest', 'Lisbon',
       'London', 'Paris', 'Rome', 'Vienna'], dtype=object)
In [418]:
# Unikalūs kambarių tipai
airbnb['room_type'].unique()
Out[418]:
array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)
In [419]:
# Skaitinių stulpelių greita statistinė analizė
airbnb.describe()
Out[419]:
price person_capacity cleanliness_rating guest_satisfaction_overall bedrooms citycentre_dist metro_dist lng lat price_per_person
count 51707.000000 51707.000000 51707.000000 51707.000000 51707.00000 51707.000000 51707.000000 51707.000000 51707.000000 51707.000000
mean 279.879591 3.161661 9.390624 92.628232 1.15876 3.191285 0.681540 7.426068 45.671128 95.038708
std 327.948386 1.298545 0.954868 8.945531 0.62741 2.393803 0.858023 9.799725 5.249263 121.129949
min 34.779339 2.000000 2.000000 20.000000 0.00000 0.015045 0.002301 -9.226340 37.953000 8.851498
25% 148.752174 2.000000 9.000000 90.000000 1.00000 1.453142 0.248480 -0.072500 41.399510 51.672921
50% 211.343089 3.000000 10.000000 95.000000 1.00000 2.613538 0.413269 4.873000 47.506690 75.290339
75% 319.694287 4.000000 10.000000 99.000000 1.00000 4.263077 0.737840 13.518825 51.471885 114.622850
max 18545.450285 6.000000 10.000000 100.000000 10.00000 25.284557 14.273577 23.786020 52.641410 9272.725142
In [420]:
# Tekstinių stulpelių analizė
airbnb.describe(include='object')
Out[420]:
room_type city weekday
count 51707 51707 51707
unique 3 10 2
top Entire home/apt London weekends
freq 32648 9993 26207
In [421]:
# Airbnb pasiūlymų kiekiai skirtinguose miestuose
airbnb.groupby('city')['price'].count().sort_values(ascending=False)
Out[421]:
city
London       9993
Rome         9027
Paris        6688
Lisbon       5763
Athens       5280
Budapest     4022
Vienna       3537
Barcelona    2833
Berlin       2484
Amsterdam    2080
Name: price, dtype: int64
In [422]:
# Airbnb pasiūlymų kiekiai skirtinguose miestuose savaitės dienomis ir savaitgaliais
airbnb.groupby(['city', 'weekday']).agg({'room_type' : 'count'}).reset_index().pivot(
    'city', 'weekday', 'room_type').sort_values(by= ['weekdays', 'weekends'], ascending=[False, False]).rename(
    columns={'weekdays' : 'quantity_in_weekdays', 'weekends' : 'quantity_in_weekends'})
Out[422]:
weekday quantity_in_weekdays quantity_in_weekends
city
London 4614 5379
Rome 4492 4535
Paris 3130 3558
Lisbon 2857 2906
Athens 2653 2627
Budapest 2074 1948
Vienna 1738 1799
Barcelona 1555 1278
Berlin 1284 1200
Amsterdam 1103 977

Europos miestai pagal Airbnb kainą¶

In [423]:
airbnb.sort_values(by='price', ascending = False)
Out[423]:
price room_type room_shared room_private person_capacity host_is_superhost multiple_room business_room cleanliness_rating guest_satisfaction_overall bedrooms citycentre_dist metro_dist lng lat city weekday price_per_person
3590 18545.450285 Entire home/apt False False 2.0 True False True 10.0 100.0 1 1.196536 0.381128 23.73200 37.98600 Athens weekdays 9272.725142
34803 16445.614689 Entire home/apt False False 2.0 False False False 9.0 100.0 1 4.602378 0.118665 2.29772 48.83669 Paris weekdays 8222.807345
24348 15499.894165 Entire home/apt False False 3.0 True False True 10.0 95.0 3 0.269101 0.227193 -0.13038 51.50995 London weekdays 5166.631388
48380 13664.305916 Private room False True 2.0 False False False 9.0 87.0 1 2.239501 0.414395 16.34356 48.20751 Vienna weekdays 6832.152958
50787 13656.358834 Private room False True 2.0 False False False 9.0 87.0 1 2.239486 0.414409 16.34356 48.20751 Vienna weekends 6828.179417
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5316 42.884259 Private room False True 3.0 False False True 10.0 90.0 1 0.953017 0.370606 23.72700 37.98100 Athens weekends 14.294753
15917 40.184236 Private room False True 3.0 False True False 9.0 90.0 3 8.014306 4.193543 19.13895 47.54219 Budapest weekends 13.394745
13954 39.009259 Private room False True 3.0 False True False 9.0 90.0 3 8.014301 4.193548 19.13895 47.54219 Budapest weekdays 13.003086
13884 37.129295 Entire home/apt False False 2.0 False False False 10.0 93.0 2 4.644683 0.557410 19.11517 47.50491 Budapest weekdays 18.564647
15563 34.779339 Private room False True 2.0 False True False 10.0 97.0 1 9.986018 7.847268 18.96347 47.56406 Budapest weekends 17.389670

51707 rows × 18 columns

In [4]:
# Brangiausi Airbnb pasiūlymai
fig = px.scatter_mapbox(airbnb, lat='lat', lon='lng', hover_name='price_per_person', hover_data=['room_type', 'person_capacity'],
                        color_discrete_sequence=['red'], zoom=3, size='price_per_person', size_max=20, width=900)
fig.update_layout(mapbox_style='open-street-map')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
print('Didžiausios Airbnb kainos')

pyo.iplot(fig, filename='Didžiausios Airbnb kainos')
Didžiausios Airbnb kainos
In [425]:
# Vidutinė Airbnb kaina pagal miestus darbo dienomis ir savaitgaliais
airbnb_vid_kaina = airbnb.groupby(['city', 'weekday']).agg({'price' : 'mean'}).reset_index().pivot(
    'city', 'weekday', 'price').sort_values(by=['weekdays', 'weekends'], ascending = [False, False]).reset_index()

airbnb_vid_kaina.plot(x= 'city', y= ['weekdays', 'weekends'], title= 'Vidutinė Airbnb kaina', kind= 'bar', figsize=(
    10, 2),color=['gold', 'skyblue'])
Out[425]:
<AxesSubplot:title={'center':'Vidutinė Airbnb kaina'}, xlabel='city'>
In [426]:
# Vidutinė Airbnb kaina žmogui pagal miestus darbo dienomis ir savaitgaliais
airbnb_vid_kaina = airbnb.groupby(['city', 'weekday']).agg({'price_per_person' : 'mean'}).reset_index().pivot(
    'city', 'weekday', 'price_per_person').sort_values(by=['weekdays', 'weekends'], ascending = [False, False]).reset_index()

airbnb_vid_kaina.plot(x= 'city', y= ['weekdays', 'weekends'], title= 'Vidutinė Airbnb kaina žmogui', kind= 'bar', figsize=(
    10, 2),color=['salmon', 'lightgreen'])
Out[426]:
<AxesSubplot:title={'center':'Vidutinė Airbnb kaina žmogui'}, xlabel='city'>
In [427]:
airbnb_vid_kaina
Out[427]:
weekday city weekdays weekends
0 Amsterdam 194.624966 218.386276
1 Paris 140.356389 134.989334
2 London 126.515587 126.805257
3 Barcelona 104.601522 121.310802
4 Berlin 90.197117 94.640790
5 Vienna 79.038560 78.595960
6 Lisbon 74.172944 76.356491
7 Rome 64.218883 66.477167
8 Budapest 50.651854 56.691099
9 Athens 46.538078 42.694725
In [428]:
# Vidutinė Airbnb kaina žmogui pagal miestus darbo dienomis ir savaitgaliais - boxplot
plt.figure(figsize=(12, 3))
ax = plt.subplot()
plt.axis([0,8,0,400])
sns.set_theme(style='ticks', palette='pastel')
sns.boxplot(x='city', y='price_per_person', hue='weekday', palette=['salmon', 'darkgrey'], 
            data=airbnb, fliersize=0.5, linewidth=1, order=airbnb.groupby(
                'city')['price_per_person'].median().sort_values(ascending=False).index)
plt.ylabel('Airbnb_price_per_person')
plt.grid(axis='y', color='lightgrey', linestyle='--', linewidth=.5)
plt.legend(loc=1)
plt.show()

Airbnb įvertinimas¶

In [429]:
# Švaros įvertinimas
airbnb.groupby('city')['cleanliness_rating'].mean().sort_values(ascending=False)
Out[429]:
city
Athens       9.638447
Rome         9.514678
Budapest     9.477374
Vienna       9.472434
Amsterdam    9.465865
Berlin       9.461755
Lisbon       9.370640
Barcelona    9.291564
Paris        9.263606
London       9.175023
Name: cleanliness_rating, dtype: float64
In [430]:
# Bendras įvertinimas
airbnb.groupby('city')['guest_satisfaction_overall'].mean().sort_values(ascending=False)
Out[430]:
city
Athens       95.003598
Budapest     94.585281
Amsterdam    94.514423
Berlin       94.323671
Vienna       93.731128
Rome         93.122300
Paris        92.037530
Barcelona    91.109072
Lisbon       91.093875
London       90.645652
Name: guest_satisfaction_overall, dtype: float64
In [431]:
airbnb.loc[:, ['cleanliness_rating', 'guest_satisfaction_overall']].describe()
Out[431]:
cleanliness_rating guest_satisfaction_overall
count 51707.000000 51707.000000
mean 9.390624 92.628232
std 0.954868 8.945531
min 2.000000 20.000000
25% 9.000000 90.000000
50% 10.000000 95.000000
75% 10.000000 99.000000
max 10.000000 100.000000
In [432]:
# Ryšys tarp švaros įvertinimo ir bendro Airbnb klientų pasitenkinimo
sns.regplot(data= airbnb, x= 'cleanliness_rating', y= 'guest_satisfaction_overall',
            scatter_kws={'color': 'darkkhaki'}, line_kws={'color': 'dimgrey'})
Out[432]:
<AxesSubplot:xlabel='cleanliness_rating', ylabel='guest_satisfaction_overall'>
In [433]:
# Koreliacijos stiprumas
corr, p_value = pearsonr(airbnb['cleanliness_rating'], airbnb['guest_satisfaction_overall'])

print('Correlation coefficient:', corr)
print('p-value:', p_value)
Correlation coefficient: 0.7140450220820529
p-value: 0.0

Airbnb kainos nustatymo modelis¶

Airbnb kainos nustatymo modelio kūrimui pasirinktas miestas Roma - vienas iš didžiausių pagal Airbnb pasiūlymų kiekį (atroje vietoje po Londono) bei išsiskiriantis labiau koncentruotomis Airbnb kainomis (mažesnis standartinis nuokrypis).b

Romos analizė¶

In [434]:
airbnb_Rome = airbnb[(airbnb['city'] == 'Rome')]
airbnb_Rome.sort_values(by='price_per_person', ascending = False)
Out[434]:
price room_type room_shared room_private person_capacity host_is_superhost multiple_room business_room cleanliness_rating guest_satisfaction_overall bedrooms citycentre_dist metro_dist lng lat city weekday price_per_person
47009 2311.738714 Private room False True 2.0 False False False 10.0 100.0 1 1.731672 1.054644 12.50391 41.91641 Rome weekends 1155.869357
42518 2305.192528 Private room False True 2.0 False False False 10.0 100.0 1 1.731677 1.054641 12.50391 41.91641 Rome weekdays 1152.596264
45898 1384.752063 Entire home/apt False False 2.0 False False False 9.0 90.0 1 2.678288 0.229584 12.51353 41.92346 Rome weekends 692.376032
41398 1380.777593 Entire home/apt False False 2.0 False False False 9.0 90.0 1 2.678285 0.229584 12.51353 41.92346 Rome weekdays 690.388797
40900 2418.348023 Entire home/apt False False 4.0 False False False 4.0 60.0 2 0.584549 0.377582 12.50100 41.90605 Rome weekdays 604.587006
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
40742 71.540458 Private room False True 4.0 False False True 7.0 78.0 1 3.709584 1.360518 12.47000 41.92400 Rome weekdays 17.885114
40660 71.540458 Private room False True 4.0 False False True 7.0 80.0 1 3.787374 1.463280 12.47000 41.92500 Rome weekdays 17.885114
43805 103.803801 Shared room True False 6.0 False True False 8.0 72.0 1 2.273166 0.241203 12.52224 41.91486 Rome weekends 17.300634
39311 103.803801 Shared room True False 6.0 False True False 8.0 72.0 1 2.273175 0.241213 12.52224 41.91486 Rome weekdays 17.300634
45466 102.634840 Entire home/apt False False 6.0 False False False 8.0 90.0 3 5.797237 3.082330 12.44594 41.87000 Rome weekends 17.105807

9027 rows × 18 columns

In [435]:
airbnb_Rome.describe()
Out[435]:
price person_capacity cleanliness_rating guest_satisfaction_overall bedrooms citycentre_dist metro_dist lng lat price_per_person
count 9027.000000 9027.000000 9027.000000 9027.000000 9027.000000 9027.000000 9027.000000 9027.000000 9027.000000 9027.000000
mean 205.391950 3.357372 9.514678 93.122300 1.229755 3.026982 0.819794 12.486139 41.895372 65.353404
std 118.618103 1.309052 0.808415 7.815107 0.549710 1.644095 0.631361 0.028827 0.017964 37.352868
min 46.057092 2.000000 2.000000 20.000000 0.000000 0.042789 0.011093 12.400790 41.818000 17.105807
25% 138.405069 2.000000 9.000000 91.000000 1.000000 1.880467 0.325294 12.467430 41.884000 45.005027
50% 182.591822 3.000000 10.000000 95.000000 1.000000 2.815721 0.621587 12.480000 41.897300 57.793468
75% 240.806116 4.000000 10.000000 98.000000 1.000000 4.030506 1.220111 12.505560 41.907190 76.742337
max 2418.348023 6.000000 10.000000 100.000000 5.000000 9.553819 4.147201 12.582980 41.951780 1155.869357

Skaitinių kintamųjų ryšys su Airbnb kaina. Koreliacijos stiprumas.¶

In [436]:
# 6 subplots
fig, axs = plt.subplots(2,3, figsize=(12, 4))

# 1
sns.regplot(ax=axs[0,0], data=airbnb_Rome, x='person_capacity', y='price_per_person', 
            scatter_kws={'color': 'cadetblue'}, line_kws={'color': 'dimgrey'})
axs[0, 0].set_xlim([0, 8])
axs[0, 0].set_ylim([0, 400])

# 2
sns.regplot(ax=axs[0,1], data=airbnb_Rome, x='bedrooms', y='price_per_person', 
            scatter_kws={'color': 'salmon'}, line_kws={'color': 'dimgrey'})
axs[0, 1].set_xlim([-1, 6])
axs[0, 1].set_ylim([0, 400])

# 3
sns.regplot(ax=axs[0,2], data=airbnb_Rome, x='citycentre_dist', y='price_per_person', 
            scatter_kws={'color': 'orange'}, line_kws={'color': 'dimgrey'})
axs[0, 2].set_xlim([0, 10])
axs[0, 2].set_ylim([0, 400])

# 4
sns.regplot(ax=axs[1,0], data=airbnb_Rome, x='metro_dist', y='price_per_person', 
            scatter_kws={'color': 'skyblue'}, line_kws={'color': 'dimgrey'})
axs[1, 0].set_xlim([0, 5])
axs[1, 0].set_ylim([0, 400])

# 5
sns.regplot(ax=axs[1,1], data=airbnb_Rome, x='guest_satisfaction_overall', y='price_per_person', 
            scatter_kws={'color': 'lightgreen'}, line_kws={'color': 'dimgrey'})
axs[1, 1].set_xlim([0, 100])
axs[1, 1].set_ylim([0, 400])

# 6
sns.regplot(ax=axs[1,2], data=airbnb_Rome, x='cleanliness_rating', y='price_per_person', 
            scatter_kws={'color': 'darkkhaki'}, line_kws={'color': 'dimgrey'})
axs[1, 2].set_xlim([0, 11])
axs[1, 2].set_ylim([0, 400])

# display the plot
plt.tight_layout()
plt.show()
In [437]:
corr, p_value = pearsonr(airbnb_Rome['person_capacity'], airbnb_Rome['price_per_person'])
print('Correlation coefficient person_capacity:', corr)
print('p-value:', p_value)

corr, p_value = pearsonr(airbnb_Rome['bedrooms'], airbnb_Rome['price_per_person'])
print('Correlation coefficient bedrooms:', corr)
print('p-value:', p_value)

corr, p_value = pearsonr(airbnb_Rome['citycentre_dist'], airbnb_Rome['price_per_person'])
print('Correlation coefficient citycentre_dist:', corr)
print('p-value:', p_value)

corr, p_value = pearsonr(airbnb_Rome['metro_dist'], airbnb_Rome['price_per_person'])
print('Correlation coefficient metro_dist:', corr)
print('p-value:', p_value)

corr, p_value = pearsonr(airbnb_Rome['guest_satisfaction_overall'], airbnb_Rome['price_per_person'])
print('Correlation coefficient guest_satisfaction_overall:', corr)
print('p-value:', p_value)

corr, p_value = pearsonr(airbnb_Rome['cleanliness_rating'], airbnb_Rome['price_per_person'])
print('Correlation coefficient cleanliness_rating:', corr)
print('p-value:', p_value)
Correlation coefficient person_capacity: -0.2868346363176606
p-value: 1.615772175093204e-170
Correlation coefficient bedrooms: -0.09064087331827733
p-value: 6.2210989072449664e-18
Correlation coefficient citycentre_dist: -0.184556414772082
p-value: 5.522114882146471e-70
Correlation coefficient metro_dist: 0.006324911120130413
p-value: 0.5479359538123383
Correlation coefficient guest_satisfaction_overall: 0.03326440792561239
p-value: 0.0015727929521942591
Correlation coefficient cleanliness_rating: 0.02574376340563548
p-value: 0.014445295353994761

Kainos nustatymo modelis pagal originalius duomenis¶

In [438]:
selected_attributes= ['person_capacity', 'bedrooms', 'citycentre_dist', 'metro_dist','guest_satisfaction_overall', 
                      'cleanliness_rating' ]
In [439]:
X = airbnb_Rome[selected_attributes]
X.head(2)
Out[439]:
person_capacity bedrooms citycentre_dist metro_dist guest_satisfaction_overall cleanliness_rating
39143 2.0 1 2.978468 1.595733 95.0 10.0
39144 2.0 1 0.935371 0.649269 80.0 9.0
In [440]:
y= airbnb_Rome['price_per_person']
y.head(2)
Out[440]:
39143    78.437332
39144    86.386272
Name: price_per_person, dtype: float64
In [441]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In [442]:
model = LinearRegression()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print(model.score(X_test, y_test))
0.11247942648800247

Kainos nustatymo modelis po outliers pašalinimo¶

In [443]:
airbnb_Rome_for_outliers = airbnb_Rome.loc[:, ['price_per_person', 'person_capacity', 'bedrooms', 'citycentre_dist', 
                                                   'metro_dist', 'guest_satisfaction_overall', 'cleanliness_rating']]
z_scores = stats.zscore(airbnb_Rome_for_outliers)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
airbnb_Rome_excl_outliers = airbnb_Rome_for_outliers[filtered_entries]
In [444]:
airbnb_Rome_excl_outliers.shape
Out[444]:
(8315, 7)
In [445]:
airbnb_Rome_excl_outliers.describe()
Out[445]:
price_per_person person_capacity bedrooms citycentre_dist metro_dist guest_satisfaction_overall cleanliness_rating
count 8315.000000 8315.000000 8315.000000 8315.000000 8315.000000 8315.000000 8315.000000
mean 63.476935 3.308358 1.176909 2.957820 0.780007 93.784606 9.588455
std 25.493340 1.271312 0.457052 1.569335 0.568147 5.669632 0.590366
min 17.300634 2.000000 0.000000 0.042789 0.011093 70.000000 8.000000
25% 45.355715 2.000000 1.000000 1.854963 0.320053 91.000000 9.000000
50% 57.824640 3.000000 1.000000 2.788918 0.603644 95.000000 10.000000
75% 76.099409 4.000000 1.000000 3.961438 1.187668 98.000000 10.000000
max 177.214598 6.000000 2.000000 7.909770 2.708675 100.000000 10.000000
In [446]:
X = airbnb_Rome_excl_outliers[selected_attributes]
X.head(2)
Out[446]:
person_capacity bedrooms citycentre_dist metro_dist guest_satisfaction_overall cleanliness_rating
39143 2.0 1 2.978468 1.595733 95.0 10.0
39144 2.0 1 0.935371 0.649269 80.0 9.0
In [447]:
y= airbnb_Rome_excl_outliers['price_per_person']
y.head(2)
Out[447]:
39143    78.437332
39144    86.386272
Name: price_per_person, dtype: float64
In [448]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In [449]:
model = LinearRegression()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print(model.score(X_test, y_test))
0.2512915118846336

Išvados¶

1) Didžiojoje dalyje Europos miestų Airbnb vidutinė kaina savaitgaliais buvo didesnė darbo dienomis. Didžiausias skirtumas - Amsterdame, Barselonoje ir Budapešte. Paryžiuje ir Atėnuose nustatyta, kad vidutinė siūlomų Airbnb kaina savaitgaliais buvo mažesnė nei darbo dienomis.
2) Nustatytas stiprus ryšys tarp Airbnb švaros įvertinimo bei Airbnb bendro kliento pasitenkinimo (kuo geresnis įvertinimas švaros tuo geresnis bendras klientų pasitenkinimas).
3) Sukurto Airbnb kainų nustatymo modelio tikslumas pasiektas gana žemas, nors ir buvo pašalintos išskirtys 'outliers' (0.25).
4) Tikėtina, kad turimuose duomenyse pateikiama kaina yra skirtingos trukmės Airbnb nuomai. Siūlymas rasti Airbnb duomenis, kuriuose būtų pateikiamos kainos vienodai nuomos trukmei. Tikėtina tai padėtų pagerinti modelio tikslumą.

MySQL¶
In [356]:
import mysql.connector
import pandas as pd
In [368]:
mydb = mysql.connector.connect(
    host = "localhost",
    port = '3306',
    user = 'root',
    password = 'xxx'
)
In [380]:
sakila = pd.read_sql('SELECT rating, SUM(length) FROM sakila.film GROUP BY rating', con=mydb)
sakila
C:\Users\egecaite.BAIPGROUP\Anaconda3\lib\site-packages\pandas\io\sql.py:762: UserWarning:

pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy

Out[380]:
rating SUM(length)
0 PG 21729.0
1 G 19767.0
2 NC-17 23778.0
3 PG-13 26859.0
4 R 23139.0
In [408]:
sakila = pd.read_sql('SELECT * FROM sakila.film WHERE rating IN ("G", "PG")', con=mydb)
sakila
C:\Users\egecaite.BAIPGROUP\Anaconda3\lib\site-packages\pandas\io\sql.py:762: UserWarning:

pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy

Out[408]:
film_id title description release_year language_id original_language_id rental_duration rental_rate length replacement_cost rating special_features last_update
0 1 ACADEMY DINOSAUR A Epic Drama of a Feminist And a Mad Scientist... 2006 1 None 6 0.99 86 20.99 PG {Deleted Scenes, Behind the Scenes} 2006-02-15 05:03:42
1 2 ACE GOLDFINGER A Astounding Epistle of a Database Administrat... 2006 1 None 3 4.99 48 12.99 G {Deleted Scenes, Trailers} 2006-02-15 05:03:42
2 4 AFFAIR PREJUDICE A Fanciful Documentary of a Frisbee And a Lumb... 2006 1 None 5 2.99 117 26.99 G {Behind the Scenes, Commentaries} 2006-02-15 05:03:42
3 5 AFRICAN EGG A Fast-Paced Documentary of a Pastry Chef And ... 2006 1 None 6 2.99 130 22.99 G {Deleted Scenes} 2006-02-15 05:03:42
4 6 AGENT TRUMAN A Intrepid Panorama of a Robot And a Boy who m... 2006 1 None 3 2.99 169 17.99 PG {Deleted Scenes} 2006-02-15 05:03:42
... ... ... ... ... ... ... ... ... ... ... ... ... ...
367 983 WON DARES A Unbelieveable Documentary of a Teacher And a... 2006 1 None 7 2.99 105 18.99 PG {Behind the Scenes} 2006-02-15 05:03:42
368 985 WONDERLAND CHRISTMAS A Awe-Inspiring Character Study of a Waitress ... 2006 1 None 4 4.99 111 19.99 PG {Commentaries} 2006-02-15 05:03:42
369 987 WORDS HUNTER A Action-Packed Reflection of a Composer And a... 2006 1 None 3 2.99 116 13.99 PG {Deleted Scenes, Trailers, Commentaries} 2006-02-15 05:03:42
370 991 WORST BANGER A Thrilling Drama of a Madman And a Dentist wh... 2006 1 None 4 2.99 185 26.99 PG {Deleted Scenes, Behind the Scenes} 2006-02-15 05:03:42
371 996 YOUNG LANGUAGE A Unbelieveable Yarn of a Boat And a Database ... 2006 1 None 6 0.99 183 9.99 G {Trailers, Behind the Scenes} 2006-02-15 05:03:42

372 rows × 13 columns

In [ ]: